The goal of this project is to employ machine learning techniques to calibrate data collected from Quant-Aq Pm modular air-quality sensors by leveraging information from gas-outdoor sensors. The primary objective is to enable the deployment of modular-PM air quality sensors across diverse locations and environmental conditions, ensuring reliable measurements under varying parameters such as temperature and relative humidity.
The dataset is obtained from measurements of installed QUANT-AQ modular air-quality sensors in 3101 Market st, Philadelphia. There are three sensors; two PM modular sensors (alpha and beta) and one gas sensor (outdoor). The data variables relevant here are described below:
| Column | Units | Description |
|---|---|---|
| timestamp | The sample timestamp in ISO Format | |
| Timestamp local | Corrected time zone as defined in the device setting | |
| Id | Unique Id corresponds to the record shown | |
| sn | Device serial number | |
| sample_rh | Sample relative humidity | |
| sample_pres | mmHg | Sample pressure |
| lat | degree | Latitude of the device |
| lon | degree | Longitude of the device |
| pm1 | µg/m3 | The PM1 Value |
| pm25 | µg/m3 | The PM10 Value |
| Pm10 | µg/m3 | The PM25 Value |
| pm1_model_id | Id corresponding to PM1 | |
| pm25_model_id | Id corresponding to PM2.5 | |
| pm10_model_id | Id corresponding to PM10 |
In exploring the dataset, we renamed certain variables in alpha and beta sensor to be common with the outdoor sensor. Also need to reset the seconds component to 0 for the time based index since the data obtains 1min average concentrations for each row, however this was written at different seconds for several rows. In performing a Regression, we first need a train-test split, and also to evaluate the model. A few things to note, first, the datetime column is in reverse order. Since we are working with three datasets, and we want to calibrate alpha to outdoor sensor and beta to outdoor sensor, we need pick the commmon timesteps accross the datasets should there be missing values in readings apparent with the different number of rows for each dataset. We also need to check for null values in the relevant parameters. PM1, PM25 and PM10 had a null row, which we had to drop before performing analysis.
The RH is not normally distributed. We used the log for rh for multivariable predictors in our models. To answer our objective, we want the outcome variable to be the outdoor sensor parameters, in this case PM1, PM2.5 and PM10 concentrations, while the predictor variables will be the parameters for the alpha and beta sensor each.
Correlations between dfBet and dfMod: pm1 0.969138 pm25 0.968789 pm10 0.730997 rh 0.999919 temp 0.998878 dtype: float64 Correlations between dfAlp and dfMod: pm1 0.963401 pm25 0.963212 pm10 0.762527 rh 0.999899 temp 0.998542 dtype: float64
A detailed summary of the methodology includes:
Data Collection:
Data Preprocessing:
Feature Selection:
Model Selection:
Model Training:
Calibration and Prediction:
Performance Evaluation:
Optimization:
Validation and Testing:
This methodology aims to address the calibration challenge for Quant-Aq Pm modular air-quality sensors, providing a robust and adaptable solution for accurate air quality measurements across different settings.
A time series of the training data:
PM₁ P-Value Alpha: 2.719736226298462e-51 PM₁ P-Value Beta: 8.097760848647828e-07 PM₂.₅ P-Value Alpha: 9.849494170512791e-49 PM₂.₅ P-Value Beta: 1.5673137466827215e-05 PM₁₀ P-Value Alpha: 2.2170139363403661e-16 PM₁₀ P-Value Beta: 8.281707208895426e-06
Time series calibration from linear regression:
PM₁ P-Value Alpha: 4.3904015529516365e-06 PM₁ P-Value Beta: 0.08180089864504025 PM₂.₅ P-Value Alpha: 2.264099930127344e-06 PM₂.₅ P-Value Beta: 0.06224899867334791 PM₁₀ P-Value Alpha: 0.8967040709134199 PM₁₀ P-Value Beta: 0.21997416167439787
Calibration time-series on the test data for Random Forest, included log rh and temperature in addition to the variable of interest for the alpha and beta sensor to predict the gas outdoor sensor variable of interest:
PM₁ P-Value Alpha: 3.1191276275135696e-41 PM₁ P-Value Beta: 4.0175697383781956e-06 PM₂.₅ P-Value Alpha: 1.103623188723264e-40 PM₂.₅ P-Value Beta: 1.164108947121631e-05 PM₁₀ P-Value Alpha: 0.00012038761616026669 PM₁₀ P-Value Beta: 0.05300182382762207
Calibration output for xgboost, same predictor variables with randomForest and MLR:
PM₁ P-Value Alpha: 0.23020254308289304 PM₁ P-Value Beta: 0.7554758642896344 PM₂.₅ P-Value Alpha: 0.3210355868051801 PM₂.₅ P-Value Beta: 0.7269915190286456 PM₁₀ P-Value Alpha: 1.748584892922372e-07 PM₁₀ P-Value Beta: 5.140038709752224e-06
Calibration output for MLR:
PM₁ P-Value Alpha: 0.07217912994271163 PM₁ P-Value Beta: 0.9829651276451719 PM₂.₅ P-Value Alpha: 0.08110162431725541 PM₂.₅ P-Value Beta: 0.921411076940686 PM₁₀ P-Value Alpha: 0.15980401241363587 PM₁₀ P-Value Beta: 0.07274334506097305
1. Data Variability:
Air quality sensor readings are highly susceptible to fluctuations caused by dynamic environmental conditions. Changes in temperature and humidity levels can influence the accuracy and consistency of sensor measurements. This variability poses a challenge in developing a robust calibration model that can effectively adapt to diverse environmental scenarios. The challenge of data variability arises from the fact that air quality is inherently linked to environmental conditions. For instance, pollutant concentrations often change with fluctuations in temperature and humidity. These variations may not follow a linear pattern, making it challenging to create a one-size-fits-all calibration model. As a result, the model needs to discern between genuine changes in air quality and those induced by external environmental factors.
Potential Impact: Failure to address data variability can lead to inaccurate calibration, resulting in unreliable air quality measurements. It may also hinder the model's generalization capability when deployed in locations with different environmental characteristics.
2. Sensor Drift:
Sensor drift refers to the gradual deviation of sensor readings from their initial calibrated state over time. In the context of modular air quality sensors, this drift introduces uncertainties and errors in measurements, impacting the reliability of the collected data. Modular sensors, despite initial calibration, may experience gradual shifts in their performance characteristics. Factors like sensor aging, exposure to environmental elements, or changes in internal components can contribute to this drift. If not accounted for, sensor drift can lead to systematic errors in measurements, rendering the calibration less effective over time.
Potential Impact: Unmitigated sensor drift can compromise the accuracy and longevity of the calibration model. This challenge necessitates continuous monitoring and recalibration strategies to correct for any deviations and maintain the reliability of the sensor data.
3. Limited Training Data:
Challenge Explanation: Calibrating machine learning models requires a substantial amount of labeled training data. In the case of air quality sensors, the availability of such data may be limited, hindering the model's learning capacity. Creating an effective calibration model relies on exposing it to diverse scenarios through labeled training data. However, obtaining a comprehensive dataset that encompasses various environmental conditions and locations can be challenging. Limited training data may lead to a model that struggles to generalize well, particularly in unique deployment settings. Here we used only October data, which is also limited.
Potential Impact: The model's inability to generalize due to limited training data may result in suboptimal performance, especially in environments not well-represented in the training set. This challenge underscores the importance of exploring techniques like transfer learning and data augmentation to enhance model adaptability.
The MLR model performed poorly among other models and did not produce statistically significant results with P-values greater than > 0.05, which was also apparent in the MLR calibrated time-series plots. The random Forest produced statistically significant results across both sensors and variables with the exception of the PM10 beta sensor, which had a p value of 0.053. Again, the PM10 modular sensors have been shown to be poorly correlated with the gas sensor. However, the random Forest model performs the best among all models given the p values and the time series plots. And still has statistically significant results for significance level of 0.1. Xgboost however, performed better for PM10 species with statistically significant results across all possible significance levels, but seemed to perform poorly for PM1 and PM2.5 species.